White Wine Analysis by Nair

## [1] "/Users/adarshnair/Desktop/iPythonNotebook/DAND_P4"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

There are 4898 observations with 12 features. Descriptions of the features can be found here - https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt

I am interested in analysing the Quality variable which has a scale of 0-10 with 10 being the highest quality white wine. The quality of wines in this dataset have a range of [3,9] with mean value of 5.878 and median of 6. I will factor the quality variable -

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

From this factoring we can see that most white wines have 5,6 and 7 quality scores. To get a general idea of the data we have, I will generate histogram plots for all the features.

As we can see, most of the distributions are normal, with a few skewed to the left. To check for outliers in the data, visualising the same data using boxplots will help.

There seem to outliers spread throughout the feature values for white wine, but at this this point it is hard to say if that is due to the dataset or because those are the actual values for those white wines.

Univariate Plots Section

I create a new quality rating based on 3 levels, ‘Good’, ‘Average’ and ‘Mediocre’.

wv$rating <- ifelse(wv$quality > 7, 'Good', ifelse(wv$quality <= 4, 'Mediocre', 'Average'))

I then order the ratings.

wv$rating <- ordered(wv$rating, levels = c('Mediocre', 'Average', 'Good'))

Here is the summary of the ratings variable:

summary(wv$rating)
## Mediocre  Average     Good 
##      183     4535      180

Visualizationg of the wine rating categories:

I now perform the scaling of Volatile Acidity, Citric Acid, Chlorides and Free Sulfur Dioxide to their log values:

Univariate Analysis

What is the structure of your dataset?

There are 4898 observations with 12 features. The features are: fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality. The description of the attributes is as follows: Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

What is/are the main feature(s) of interest in your dataset?

I am interested in analysing the Quality variable which has a scale of 0-10 with 10 being the highest quality white wine. The quality of wines in this dataset have a range of [3,9] with mean value of 5.878 and median of 6. ### What other features in the dataset do you think will help support your investigation into your feature(s) of interest? I used this link(http://winefolly.com/review/understanding-acidity-in-wine/) to understand the structure of wines and how to asses their quality. Based on that analysis, the acidity(wv\(pH), sweetness(wv\)residual.sugar), alcohol content(wv$alcohol) are the main driving factors to understanding and assessing the quality of wine.

Did you create any new variables from existing variables in the dataset?

To better understand the rating of the quality of the wine, I have classified the wine ratings into three categories: 0-4 is ‘Mediocre’, 5-7 is ‘Average’ and 8-9 is ‘Good’. I have used the official wine 100 point rating scale as inspiration. Based on that metric we have 183 Mediocre, 4535 Average and 180 Good category wines.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

To get a better scale on the values of Volatile Acidity, Citric Acid, Chlorides and Free Sulfur Dioxide, I converted them to their log values.

Bivariate Plots Section

To perfom my bivariate analysis I create plots to check if there are any clear visible relationships between certain features I have a predisposition to thinking have relationships.

I start by exploring the relationship between Quality and pH

ggplot(aes(x = wv$quality, y = wv$pH),
       data = wv) +
  geom_point(alpha = 1/5, position = position_jitter(h = 0))+
  xlab('Quality of wine') +
  ylab('pH of wine')

Exploring the relationship between Quality and Sweetness

ggplot(aes(x = wv$quality, y = wv$residual.sugar),
       data = wv) +
  geom_point(alpha = 1/5, position = position_jitter(h = 0)) +
  xlab('Quality of wine') +
  ylab('Residual Sugar')

Exploring the relationship between Quality and Alcohol

ggplot(aes(x = wv$quality, y = wv$alcohol),
       data = wv) +
  geom_point(alpha = 1/5, position = position_jitter(h = 0)) + 
  xlab('Quality of wine') +
  ylab('Alcohol content')

Exploring the relationship between Quality and Volatile Acidity

ggplot(aes(x = wv$quality, y = wv$volatile.acidity),
       data = wv) +
  geom_point(alpha = 1/5, position = position_jitter(h = 0)) +
  xlab('Quality of wine') +
  ylab('Volatile Acidity')

Noticing that it is hard to come to a real conclusion based on these bivariate graphs, I now explore their corresponding r scores (Exploring the correlation values between Quality and our other features.)

Based on the correlation values I find that ALCOHOL, DENSITY and CHLORIDES have the highest r values.

I now create 3 plots to understand their relationship better by studying how their quantile relationships.

In the next 3 plots, the blue line is the 10% quantile line, the red line in the 90% quantile line and the green line is the 50% quantile line, while the black line is the mean value line.

Now I performa similar analysis to the above based on our predisposition to quality metrics, I analyse pH and residual.sugar:

Lastly, I analyse relationships between features that are not our primary features.

Fixed acidity and Volatile acidity:

I analyse the r score:

with(wv, cor.test(as.numeric(fixed.acidity), as.numeric(volatile.acidity), method = 'pearson'))

Fixed acidity and citric acid

I analyse the r score:

with(wv, cor.test(as.numeric(fixed.acidity), as.numeric(citric.acid), method = 'pearson'))

Volatile acidity and citric acid

I analyse the r score:

with(wv, cor.test(as.numeric(volatile.acidity), as.numeric(citric.acid), method = 'pearson'))

Free sulfur dioxide and total sulfur dioxide

I then analyse the r score

with(wv, cor.test(as.numeric(free.sulfur.dioxide), as.numeric(total.sulfur.dioxide), method = 'pearson'))

Fixed acidity and pH

I analyse the r score:

with(wv, cor.test(as.numeric(fixed.acidity), as.numeric(pH), method = 'pearson')) 

Volatile acidity and pH

I analyse the r score:

with(wv, cor.test(as.numeric(volatile.acidity), as.numeric(pH), method = 'pearson'))

I now compute r score with normalised data(by taking the log) as follows:

with(wv, cor.test(as.numeric(quality), log10(as.numeric(pH)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(volatile.acidity)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(residual.sugar)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(chlorides)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(citric.acid)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(density)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(sulphates)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(total.sulfur.dioxide)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(free.sulfur.dioxide)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(fixed.acidity)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(alcohol)), method = 'pearson'))

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The acidity(wv\(pH), sweetness(wv\)residual.sugar), alcohol content(wv$alcohol) are the main driving factors to understanding and assessing the quality of wine. The correlation scores I obtained are as follows:

Features of interest: pH: 0.099 residual.sugar: -0.097 alcohol: 0.436

Other features: volatile.acidity: -0.194723 fixed.acidity: -0.1136628 chlorides: -0.2099344 citric.acid: -0.009209091 density: -0.3071233 sulphates: 0.05367788 total.sulfur.dioxide: -0.1747372 free.sulfur.dioxide: 0.008158067

Correlation is an effect size and so we can verbally describe the strength of the correlation using the guide that Evans (1996)(http://www.statstutor.ac.uk/resources/uploaded/pearsons.pdf) suggests for the absolute value of r: .00-.19 “very weak” .20-.39 “weak” .40-.59 “moderate” .60-.79 “strong” .80-1.0 “very strong”

After analysing the r values, we see that density, alcohol and chlorides have the highest r values. I plotted the quantile graphs for these features and they reflect thee correlation shown by the r values. The graph plotting quality with alchol shows that the alcohol content tends to go higher as the quality of wine improves, especially once we look at wines with a quality rating of >6. Looking at the relationship between the density of wine and quality, there is negative correlation with the density values slighty reducing as the wine quality increases. And lastly, I analysed the relationship between quality and chlorides and we see another negative correlation albeit very subtle, with chloride content going down in higher quality wines. As we can see although our predisposition of pH, residual sugar and alcohol content being our primary features for determining quality, they have weak r scores and the graphs that show their relationship confirm that with only minor variations in values which become prevelant only in very high quality wines.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I analysed the relatioships between fixed acidity and volatile acidity and it had an r score -0.022, between fixed acidity and citric acid and it had an r score of 0.289, between volatile acidity and citric acid and it had an r score of -0.149. between free sulfur dioxide and total sulfur dioxide and it had an r score of 0.615. I also tested the relationship between the pH values and fixed and volatile acidity and found a relatioship between pH and fixed acidity at -42.5%.

Based on this analysis and the output of the graphs, we can see that there is a strong relationship between the free sulfur dioxide and total sulfur dioxide values; and ph with fixed acidity.

What was the strongest relationship you found?

The strongest relationships to quality were:

Alcohol: 43.6% Density: -30.7% Chlorides(log10): -27.2% Volatile Acidity: -19.4% Total Sulfur Dioxide: -17.4% Fixed Acidity: -11.3%

To dive further into this analysis, I re evaluated my r score with the log10 values to see if there were any stark differences after normalizing the data. The only stark difference came in the correlation with chloride values which jumped from -20.9% to -27.2%

Multivariate Plots Section

Analysis of quality with alcohol content and density

Analysis of quality with pH and residual sugar

Analysis of quality with alcohol content and residual sugar

Analysis of quality with sulphates and density

Analysis of quality with total.sulfur.dioxide and fixed.acidity

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I ran an analysis to see if by combining alcohol content and density (since they have higher correlation factors to the quality of wine) I’d see some results. Looking at the results I see that the alcohol content generally increased and the density generally decreased as the quality of the wine improved.

Were there any interesting or surprising interactions between features?

I ran some tests to check if our other predispositioned features, pH and residual sugars could give us some insight. But upon generating those graphs and facet wrapping with our new variable ‘rating’, it was still hard to say if these features were significant contributing factors.

After this I ran tests on some of the other features which did not traditionally affect wine quality based on literature and which didn’t have significant r values of correlation with quality. Tests comparing alcohol content and residual sugar to quality didn’t give much insight except for subtle inferences that residual sugars tend to go slightly down in higher quality wines. I performed similar analysis with sulphates and density over quality; and with total.sulfur.dioxide and fixed.acidity over quality. Both did not produce enough of a trend to show correlation.


Final Plots and Summary

Plot One

Alcohol and Chlorides

Description One

One of the driving factors into appreciating wine quality turns out to be the alcohol content in wine and their corresponding chloride content which is the amount of salt in the wine. Good wines tend to have a higher percentage of alcohol and a lower chloride content as can be seen in this graph.

Plot Two

Density and Total Sulfur Dioxide

Description Two

Good quality wines tend to have a low density and low total sulfur dioxide component in them. Total sulfur dioxide is the amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine and is not a preferable quality to have in good wines.

Plot Three

Acidity in wine

Description Three

Acidity plays a big role in the taste of the wine. However most wines, irrespective of their quality have a pH in the [3,4] range. The fixed acidity (tartaric acid) are the nonvolatile acids that do not evaporate and higher values in better quality wines. The volatile acidity which is the acetic acid and are the acids that can evaporate have lower values in good quality wines. Citric acid, which affects the freshness and flavor of wine is found in extremely small quantities and is non determinate factor in the quality of wines. All in all, the acidity of wines play a subtle but important role in a good quality wine!


Reflection

When performing analysis I have come to realise that having an understanding of the underlying data itself plays a key role in extracting useful inferences. In the case of this dataset, having an understanding of the components of wine and what are the key factors that affect the taste of wine as well as their quality was key. I spent some time reading about how wine is made and what are the subtle nuances that go into making different kinds of wine. There were quite a few difficulties I ran into while doing this analysis. For instance I expected acidity to have a much greater affect on the taste of wine than it actually turned out to have. And from that point on I was essentially testing all the other features to understand where correlationns may lie. This is a feesible process when there are a finite number of features such as 12 in this case and can be much harder once the number of features go up. On the other hand, one thing that greatly helped was the pearson r score. This score helped narrow down my analysis greatly and I was able to go deeper into some of the more relevant features.

Being a wine sommelier is a hard task but analysis’ like these can greatly help. I am curious to perform such tests on other spirits like beer and whiskey as well.